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Abstract 

Diagonalization of a large matrix is the computational bottleneck in many applications such as electronic 
structure calculations. We show that a speedup of over 30% can be achieved by exploiting 32-bit float- 
ing point operations, while keeping 64-bit accuracy. Moreover, most of the computationally expensive 
operations are performed by level-3 BLAS/LAPACK routines in our implementation, thus leading to 
optimal performance on most platforms. Further improvement can be made by using problem-specific 
preconditioners which take into account nondiagonal elements. 

Keywords: diagonalization, eigenvalues, electronic structure calculations, mixed precision, conjugate 
gradient method 



1. Introduction 

Matrix diagonalization plays an important role in many areas of science and engineering. In electronic 
structure calculations of complex systems, for instance, most of the computational effort is spent on the 
numerical solution of the Schrodinger equation at various levels of approximation. This procedure is 
equivalent to an eigenvalue problem in which only a small subset of the eigenvalues of a large symmetric 
matrix is desired [H 0, Q • To this end, iterative diagonalization is more favorable than direct methods 
[3| which are designed for calculating all eigenvalues of dense matrices. 

Numerical calculations on modern computers are generally performed using 64-bit double precision 
(DP) floating-point numbers, which are accurate to 15-16 significant digits. On the other hand, 32- 
bit single precision (SP) floating-point numbers with 7-8 digits of accuracy are more efRcient in terms of 
computational cost, memory usage, network bandwidth, and disk storage [5j- In recent years, considerable 
effort has been made to obtain the results with DP accuracy at the expense of SP operations. In particular, 
a variety of mixed precision algorithms have been developed for the solution of linear equations [^, 0] . 

Similarly, several researchers have attempted to solve eigenvalue problems with mixed precision algo- 
rithms in the past 0, @] . Unfortunately, most of these approaches correspond to direct methods for dense 
matrices, and thus are of limited use in electronic structure calculations. In this paper, we show how 
to exploit the mixed precision arithmetic for iterative solution of large-scale eigenvalue problems, with 
special emphasis on electronic structure calculations. 



2. Theory 

2.1. Trace minimization method 

Let H be a real, symmetric matrix of dimension N . The eigenvalues and eigenvectors of T-L are defined 

by 

He, -A,e„ i = l,2,---,iV, (1) 

where Ai < A2 < ... < Ajv, and ei, e2, • • • , e^f form a set of orthonormal vectors. We also assume that % 
is a sparse matrix with 0{N) nonzero entries. 
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Our aim is to calculate the m lowest eigenvalues of % and the corresponding eigenvectors, where 
1 <C m ^ A^, and typically, m = 10^-10'^ and N = 10^-10® in electronic structure calculations hOf. To 
be precise, it is often sufficient to calculate only the sum of the eigenvalues. 



i=i 



(2) 



and the subspace spanned by ei,e2, 
system [l|, [3, 



TO, where Eq corresponds to the ground-state energy of the 

(3) 



Hereafter we assume the presence of a gap in the spectrum, i.e.. 



which is satisfied in nonmetallic systems H, 0, 0| ■ 

The trace minimization method [1, [H i is based on the fact that if 



i;(ci,C2, • • • ,c™) 



is minimized subject to the orthonormality conditions 



T 



E = Eq holds at the minimum, and C — (ci C2 • 
matrix form, Eqs. @ and ([5]) can be written as 



= l,2,---,m, 
■ c„i) spans the same subspace as {ei, 62, ' 



E{C) = trace [C'^UC) 



and 



(4) 
(5) 

S?n } ■ In 

(6) 
(7) 

respectively. These equations are invariant under any unitary transformation of C . 

Although we focus on the standard eigenvalue problem of Eq.([T]) in this work, the extension to the 
gen eralized eigenvalue problem {%& — \Se) is straightforward if 5 is a symmetric, positive definite matrix 
This property allows us to use nonorthogonal basis functions (13 [l3, 14, 3| with ease. Moreover, 
the trace minimization method is equally valid even if % depends on the eigenvectors of % itself, as 
explained in §9.4.3.4 of Ref.Jl. 

In FiglU we illustrate the numerical implementation of the trace minimization method in which E{C) 
is minimized directly with respect to C using the nonlinear conjugate gradient method 3 XL, 3l- This 
procedure is often referred to as a direct energy minimization in electronic structure community[l|, [3, 01 • 
Here C,G,P <^ R^^™, and a line minimization is performed along Pi to determine Ui 1^ 23- Moreover, 
7i is given by 

i^Q 



7^ = \ trace((G', -G,-!)^^^ 
trace(Gf_iG,_i) 



1,2, 



(8) 



following the Polak-Ribiere formula [l9| . 

When ||Gi||, the Frobenius norm of G^, is sufficiently small, Ei will be equal to Eq, and 

'HC = CH 



(9) 



will hold. Diagonalization of H £ R^mxm ^fJ^\\ g^ve the explicit eigenvalues and eigenvectors of if 
necessary. The number of iterations to reach convergence is estimated by 21 1 



Alitor OC 



1 



(10) 



gap 



Therefore, a naive implementation of the trace minimization method fails in the limit of a vanishing gap. 
In this case, more complex (and thus more costly) algorithms 22, 2^ should be employed to avoid the 
slow convergence. 



2 



Choose initial guess Co subject to CqCq = Im 
for i = 0,1,2,... 

Calculate E, = E{C^) and G, = -VE{C^) 
check convergence; continue if necessary 
Pi = G,+ 7,P^_i 
C'i+i = Ci + aiPi 
Orthonormalize (C^+i) 

end 



Figure 1: Trace minimization method using the nonlinear conjugate gradient method. 



Calculate E{C) and G = -\7E{G) 
Algorithm- 1: 

(1) X^HC 

(2) H = C^X 

(3) E = trace(ff) 

(4) G = -2{X - GH) 



DGEMM 



0{mN) 
o\m^N) 

aim) 

DGEMM olw?N) 



Algorithm-2: 

(1) x = nG 

(2) Hu = diag(C^X) 

(3) E = tracc(i/D) 

(4) Quit if G is unnecessary 

(5) X' = X- CHu 

(6) H' = G^X' 

(7) G = -2{X' - GH') 



DGEMM 
DGEMM 



0{mN) 
0{mN) 
0{m) 

0{mN) 

0{m^N) 

0{m^N) 



Figure 2: Two (mathematically) equivalent procedures for calculating the energy and gradient. The second and third 
columns show the corresponding BLAS/LAPACK routines (if any) and their computational costs. H, H' , Hj^ are symmetric 
matrices G R™^™, and X,X' G R'^^'". 



The procedure for calculating the energy and the gradient is illustrated in Figl2] Algorithm-1 is a 
naive implementation which is shown only for illustrative purposes. Algorithm-2 is a more practical 
implementation which is appropriate for incorporating the mixed precision arithmetic explained in ^12. 21 
Since H' is a symmetric matrix of dimension m, only the upper (or lower) triangular part needs to be 
calculated explicitly in step-(6). In particular, when m is large, H' should be divided into smaller blocks, 
as shown in FiglH with each block being calculated by a separate DGEMM (or SGEMM) call. 

Similarly, the orthonormalization procedure is shown in FiglH Strictly speaking, explicit orthonor- 
malization of G is inconsistent with the conjugate gradient method, which is designed for unconstrained 
minimization problems. In our experience, however, only minor effects are seen when m <^ N is satisfied. 

Since all 0{m?N) operations introduced in this section are performed by level-3 BLAS routines, 
optimal performance is achieved on most platforms [23] • 

2.2. Mixed precision arithmetic 

Here we present several ideas for improving the performance of the algorithm introduced in ^ 12. II 
by taking advantage of inexpensive SP operations. The first approach is to perform all floating-point 
operations in DP (full DP), which will serve as a reference in the following. On the other hand, it is 
also possible to replace all DP operations by SP (full SP), which is expected to achieve the largest gain 
in terms of computational cost and memory usage. Unfortunately, as will be shown below, this approach 
does not provide sufficient accuracy, and thus is of limited use in real applications. 

A more practical approach is to start from full DP, and incorporate SP operations progressively. 
Mixed precision variant- 1 (MPl) is a conservative approach which aims at achieving a reasonable gain, 
while keeping full DP accuracy. 

(i) The change of Ci, ||Ci+i — Ci||, in Fig[Tl will become much smaller than ||Ci|| as convergence is 
approached. Therefore, Gi and Pi can be stored in memory in SP format without sacrificing accuracy. 
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Figure 3: (a) Recursive and (b) row-wise splitting of tlie upper triangular part of a symmetric matrix. 

Orthonormalize (Cin): 

(1) 5 = C^Cin DSYRK Oim^N) 

(2) Calculate L, where S = LL'^ DPOTRF 0{m^) 

(3) Calculate DTRTRI ©(m^) 

(4) Cout = C'inL-^ DTRMM 0{m^N) 

Figure 4: Orthonormalization procedure based on the Cholesky factorization. Same notation as in Fig[2] 5 is a symmetric 
matrix of dimension m, and L is a lower triangular matrix of dimension m (so is L~^). On exit, Cin is overwritten by Cout- 



Here, the elements of Gi are first calculated in DP to minimize round-off errors in FiglU followed by 
conversion to SP. Conversion between SP and DP is performed with the Fortran intrinsic functions Sngl 
and Dble. While this change alone leads to only a minor performance gain, a large gain can be made 
when used in conjunction with advanced preconditioners, as will be discussed in 21 Furthermore, a 
significant reduction of memory usage is expected. 

(ii) Similarly, the change of Ci during the orthonormalization procedure shown in Fig|3] decreases from 
iteration to iteration. Therefore, after calculating L~^, we introduce a DP matrix 

Ld = diag(L-i), (11) 

and an SP matrix 

L' = L-^-Lu, (12) 
where Ld — > Im and L' — > as convergence is approached. Then, the last step can be rewritten as 

C'out = CinLD + CinL', (13) 

where the first and the second terms are calculated in DP and SP, respectively. Obviously, the former 
cost is negligible, while the latter can be performed with an STRMM call instead of DTRMM. The final 
result, Cout, is stored in DP format. 

In addition to the changes noted above, further acceleration is achieved in the mixed precision variant- 
2 (MP2) by reducing the cost of evaluating the gradient in FiglH Algorithm-2, as follows. Since the 
computational cost of this procedure is dominated by steps-(l), (6), and (7), the two DGEMM calls 
in steps- (6) and (7) are replaced by SGEMM. The rest of the operations in this procedure, including 
step-(l), are performed in DP, which guarantees DP accuracy of the energy. Step-(5), corresponding to 
self-orthogonalization of the gradient, is also performed in DP, which allows us to reduce the round-off 
errors arising from the use of SGEMM in steps- (6) and (7) significantly. It is also preferable to set 
diag(i?')=0 explicitly after step-(6) to minimize the errors. Nevertheless, MP2 has the potential risk of 
obtaining inaccurate gradient when close to convergence. 
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Figure 5: Performance of matrix-matrix multiplications in SP (SGEMM) and DP (DGEMM). 



When MP2 is used, all 0{m?N) operations except the DSYRK call in the orthonormalization proce- 
dure are performed in SP. The performance and accuracy of these algorithms are compared in the next 
section. 



3. Numerical results 



All calculations shown here were performed on a single core of the 2.3 GHz AMD Opteron 6176 SE 
rocessor with CentOS 5.5 operating system, gfortran 4.1.2 compiler, and GotoBLAS2 numerical library 



prO ' 

m 



We first show the performance of matrix-matrix multiplications in FiglSl The SP routine (SGEMM) 
is found to be 1.85-1.95 times faster than the corresponding DP routine (DGEMM) when the matrix size 
is larger than 200. 

Then, we compare the performance of full DP, MPl, MP2, and full SP for calculating the sum of the 
m lowest eigenvalues of large sparse matrices corresponding to the two-dimensional discrete Laplacian 
under Dirichlet boundary conditions. 
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The dimension N is given by = n'^, where n denotes the grid size in each direction. The eigenvalues 
of "Hat are given explicitly by 



A 



4 sin-" 



2(n+ 1) 



2{n+l] 



p,q = 1,2, ...,n. 



(16) 



After sorting {Xp^q} in ascending order, we can calculate the exact value of Eq for any value of m. 
The above Hamiltonian represents free electrons confined in a square box, which is essentially a gapless 
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Figure 6: The eigenvalue distribution ol Hn ('Eg. lfTit ) for N = 192^ 



Table 1: Performance of the four algorithms for iterative diagonalization of "Hjv. Mtcr denotes the number of iterations 
required to achieve R < 10~^^ in the full DP run. All measurements are given in units of seconds per iteration. The 
numbers in parentheses indicate the percentage relative to full DP. 



N 


m 


Eg 


fgap 




iVitcr 


Full DP 


MPl 


MP2 


Full SP 


96^ 


220 


35.25 


6.2x10" 


-a 


270 


0.631 


0.591(94%) 


0.491(78%) 


0.419(66%) 


1922 


220 


8.991 


1.9x10" 


-3 


630 


2.520 


2.357(94%) 


1.968(78%) 


1.680(67%) 


1922 


534 


50.90 


2.5x10" 


-3 


560 


12.04 


11.05(92%) 


8.772(73%) 


7.958(66%) 


1922 


1064 


196.8 


3.4x10" 


-3 


460 


45.38 


41.01(90%) 


30.52(67%) 


28.04(62%) 


1922 


1519 


395.6 


2.7x10" 


-3 


422 


89.84 


79.16(88%) 


60.53(67%) 


57.23(64%) 



system, as shown in Figl6l Therefore, this problem is a stringent test for the trace minimization method 
which requires the presence of an energy gap. 

In Table [T] we show the results of iterative diagonalization for several pairs of (iV, m) , following the 



algorithm presented in ^2.1\ Here, the values of m were chosen to satisfy egap > 0. For simplicity, the 
symmetry of H' was not exploited, and the initial guess Co was generated from random numbers, followed 
by orthonormalization. In Fig [71 we show the convergence of the residual at iteration i, defined by 



El — Eg 
Eg 



(17) 



These results suggest that the performance of the four algorithms satisfies 

Full SP > MP2 > MPl > Fuh DP, 

where the differences tend to increase with m, but not with N . In particular, full SP is 30 - 40 % faster 
than full DP, which is consistent with the results for matrix multiplications. However, the accuracy of 
the results obtained from full SP is insufficient for most applications. Therefore, full SP should be used 
only for generating the initial guess p5| . 

In contrast, MPl retains full DP accuracy for all values of {N,m) shown in Table [1] while showing 
only a modest gain of « 10 %. 

MP2 is found to achieve a gain of over 30 % for large to, while keeping near DP accuracy. However, 
the accuracy of the converged solution deteriorates slowly with to, because the use of SP is not always 
appropriate for evaluating the gradient. We have found that if subspace rotation is performed occasionally 
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Figure 7: Convergence of the residual for N = 192^: (a) m = 220, (b) m = 534, (c) m = 1064, and (d) m = 1519. 

to diagonalize H, this problem can be overcome, thus leading to full DP accuracy. Alternatively, one can 
simply switch from MP2 to MPl (or full DP) algorithm when the residual is below a g iven tolerance. 
The latter approach is preferable in terms of the construction of conjugate directions [19], as well as the 
extrapolation of the initial guess [26] . 

4. Discussion 

In §2.1) we illustrated the implementation of the trace minimization method using the basic conjugate 
gradient method. In this section, we show how to improve the convergence rate of the conjugate gradient 
method by a linear transformation of the gradient 

G' = MG. (18) 

Here, M e '^nxn jg ^ symmetric, positive definite matrix called a preconditioner, which should be a 
reasonable approximation to the inverse Hessian (See §11 of Ref.j^]). Preconditioning allows us to reduce 
the number of iterations to reach convergence at the expense of increased cost per iteration. In electronic 
structure calculations, this generally leads to an increase in 0{mN) cost and a decrease in 0{m^N) 
cost. Therefore, the choice of the preconditioner becomes more important for larger applications. In 
plane-wave-based electronic structure calculations, it is a common practice to use a diagonal matrix for 
M, which leads to a considerable reduction of the number of iterations [17] at a negligible cost. 
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However, more elaborate preconditioners have been developed beyond the diagonal approximation 
for plane waves ^27*1, atomic orbitals [i^l, and real space basis sets 23, 3^, 311- These preconditioners 
significantly improve the convergence rate, while the computational cost of evaluating the right-hand 
side of Ea. ([T5|) will become non- negligible. Since SP is generally sufficient for representing G, as already 
mentioned in ^2.2\ the same will hold for M, unless M is an ill-conditioned matrix. Therefore, the matrix 
multiplication of Eq.(|18|) can also be performed in SP, thus minimizing the overhead of preconditioning. 



The idea behind this approach is similar in spirit to the solution of linear equations discussed in Ref. [5| . 
These advanced preconditioners will be particularly beneficial when highly accurate methods [s^l are 
used for the electronic structure calculations. 

Although we have focused on the conjugate gradient method, the basic idea presented in this work 
should be equally valid for other iterative algorithms 33, 34, [s^, [s^ 37 1. In our electronic structure 



code Femteck t38i] , the trace minimization method is used in conjunction with the limited memory 
BFGS (Broyden-Fletcher-Goldfarb-Shanno) method [H, \M, EH, E^] , which requires practically no line 
minimization. When the MP2 algorithm is used, together with a multigrid approximation to the inverse 
Hessian (sij , the computational time for the ground-state calculation is reduced by a factor of about two 
in large-scale applications [lo| , compared with the full DP calculation using a diagonal approximation to 
the Hessian (4l|,|42|. If we start from a reasonable initial guess [26], the ground-state energy is obtained 
in 10-15 iterations in nonmetallic materials [9, 10]. 



5. Conclusion 

We have shown in this work that iterative solution of the eigenvalue problem can be accelerated 
significantly by taking advantage of mixed precision arithmetic without relying on external devices. Even 
further improvement can be made by using problem-specific preconditioners which take into account 
nondiagonal elements. These methods will become more important with increasing problem size. 

For most applications, MP2 is a good compromise between accuracy and computational cost. When 
highly accurate eigenvalues/ vectors are desired, we recommend to switch from MP2 to MPl (or full DP) 
at some point before the convergence slows down. 

While the current implementation is designed for ground-state electronic structure calculations, it 
would also be interesting to investigate the use of mixed precision arithmetic in other problems such as 
the density- functional perturbation theory 43, 44, 4^. 
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